Practice2-Fraud¶

  • Author: Francisco Martínez García

In this practice it will be analized the correlation between certain variables and a payment fraud event.

First, we will analyse, select and do the necessary changes on the data. Afterwards, we will create several prediction models in order to select the better one. Once this selection is done. the explicability of the results will be studied.

The data was provided by the professors of the course

Library import¶

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px

import pickle

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')

#Functions document
import functions as fx

Data Reading¶

In [2]:
#Read the data and edit the ',' in numeric data
pd_fraud = (pd.read_csv("../data/Original_dataset_payments_fraud.csv", delimiter=";")
    .assign(**{'connection_time': lambda df: df['connection_time'].str.replace(',', '.').astype(float)})
)
In [3]:
pd_fraud.head()
Out[3]:
step type amount gender device connection_time nameOrig race oldbalanceOrg age newbalanceOrig zone user_number nameDest user_connections security_alert oldbalanceDest newbalanceDest isFraud
0 1 PAYMENT 9839.64 man mac 0.140039 C1231006815 black 170136.0 85 160296.36 capital 138 M1979787155 5 1 0.0 0.0 0
1 1 PAYMENT 1864.28 woman mac 0.496890 C1666544295 asian 21249.0 57 19384.72 country 909 M2044282225 1 0 0.0 0.0 0
2 1 TRANSFER 181.00 man pc 0.781150 C1305486145 asian 181.0 66 0.00 capital 2569 C553264065 10 0 0.0 0.0 1
3 1 CASH_OUT 181.00 man mac 0.565068 C840083671 black 181.0 31 0.00 country 1787 C38997010 3 0 21182.0 0.0 1
4 1 PAYMENT 11668.14 unknow mac 0.517114 C2048537720 black 41554.0 90 29885.86 country 3997 M1230701703 8 0 0.0 0.0 0

We proceed to delete the gender and race variables in order not to discriminate anyone. Furthermore, nameOrig and nameDest are dropped too as these variables are just to identify the user, not relevant for this study.

In [4]:
pd_fraud = pd_fraud.drop(["gender", "race", "nameOrig", "nameDest"], axis=1)
In [5]:
#Duplicates
print(pd_fraud.shape, pd_fraud.drop_duplicates().shape)
(1048575, 15) (1048575, 15)

There are not duplicates in this dataset.

Target exploration¶

In [6]:
#Divide the target between frauds and no frauds
pd_plot_fraud_status = pd_fraud['isFraud']\
        .value_counts(normalize=True)\
        .mul(100).rename('percent').reset_index()

pd_plot_fraud_status_conteo = pd_fraud['isFraud'].value_counts().reset_index()
pd_plot_fraud_status_pc = pd.merge(pd_plot_fraud_status, pd_plot_fraud_status_conteo, 
                                   on=['index'], how='inner')
pd_plot_fraud_status_pc
Out[6]:
index percent isFraud
0 0 99.89109 1047433
1 1 0.10891 1142
In [7]:
#Plot the results
fig = px.histogram(pd_plot_fraud_status_pc, x="index", y=['percent'])
fig.show()

It can be appreciated that we have a very unbalanced target with the 99.89% of the data being no fraud, while only the 0.11% are.

Null treatment¶

In [10]:
#Calculation od the null data
pd_series_null_columns = pd_fraud.isnull().sum().sort_values(ascending=False)

pd_null_columnas = pd.DataFrame(pd_series_null_columns, columns=['nulos_columnas'])     

Columns¶

In [11]:
pd_null_columnas
Out[11]:
nulos_columnas
device 104580
zone 104414
step 0
type 0
amount 0
connection_time 0
oldbalanceOrg 0
age 0
newbalanceOrig 0
user_number 0
user_connections 0
security_alert 0
oldbalanceDest 0
newbalanceDest 0
isFraud 0

Las columnas dispositivo y zona tienen un 10% de nulas. Este número es muy significativo, pero vamos a decidir dejarlos para el análisis ya que la omisión a estas respuestas nos pueden dar información útil a la hora de predecir

Assignment of numerical and categorical variables¶

In [12]:
#Selection of the categorical and numerical variables
list_cat_var, other = fx.dame_variables_categoricas(dataset=pd_fraud)
pd_fraud[list_cat_var] = pd_fraud[list_cat_var].astype("category")
list_numeric_var = list(pd_fraud.select_dtypes(include=('int','float')).columns)
pd_fraud[list_numeric_var] = pd_fraud[list_numeric_var]#.astype(float)

pd_fraud.dtypes
Out[12]:
step                   int64
type                category
amount               float64
device              category
connection_time      float64
oldbalanceOrg        float64
age                    int64
newbalanceOrig       float64
zone                category
user_number            int64
user_connections       int64
security_alert      category
oldbalanceDest       float64
newbalanceDest       float64
isFraud             category
dtype: object
In [13]:
pd_fraud[list_cat_var].dtypes
Out[13]:
type              category
device            category
zone              category
security_alert    category
isFraud           category
dtype: object

Variables' graphs¶

In [14]:
#Plot the histogram for variable device
fx.histogram(pd_fraud, 'device')

The most used device is the mac, followed by PC and by iPhone

In [15]:
#Plot the histogram for variable type
fx.histogram(pd_fraud, 'type')

As it can be appreciated in the graph above, the most frequent transactions are the payments and to take money out. The debit is very little small compared with the other options.

In [16]:
#Plot the boxplotfor variable amount
fx.box_plot(pd_fraud, 'amount')

The amount of each transaction is much higher in those occasions where the payment fraud happened

In [17]:
#Plot the boxplotfor variable age
fx.box_plot(pd_fraud, 'age')

At first sight, the boxplots of the age seem very similar when the fraud does and does not occured

Outliers¶

In [18]:
#Use the get_deviation_of_mean_perc function to calculate outliers
fx.get_deviation_of_mean_perc(pd_fraud, list_numeric_var, target='isFraud', multiplier=3)
Out[18]:
0.0 1.0 variable sum_outlier_values porcentaje_sum_null_values
0 0.986447 0.013553 step 18004 0.017170
1 0.981773 0.018227 amount 19422 0.018522
2 0.999311 0.000689 oldbalanceOrg 26140 0.024929
3 1.000000 0.000000 newbalanceOrig 26282 0.025064
4 0.999427 0.000573 oldbalanceDest 22685 0.021634
5 0.998513 0.001487 newbalanceDest 22862 0.021803

As the results show, there exists outliers. However, as the dataset is very unbalanced, the data will not be modify in case these outliers can give some explanation about the results

Correlations¶

Numerical variables¶

In [19]:
#plot the correlation matrix
fx.get_corr_matrix(dataset = pd_fraud[list_numeric_var], 
                metodo='pearson', size_figure=[10,8])
Out[19]:
0
In [20]:
#Calculate the numeric results of the correlation
corr = pd_fraud[list_numeric_var].corr('pearson')
new_corr = corr.abs()
new_corr.loc[:,:] = np.tril(new_corr, k=-1) # below main lower triangle of an array
new_corr = new_corr.stack().to_frame('correlation').reset_index().sort_values(by='correlation', ascending=False)
new_corr[new_corr['correlation']>0.6]
Out[20]:
level_0 level_1 correlation
53 newbalanceOrig oldbalanceOrg 0.999050
98 newbalanceDest oldbalanceDest 0.978401

There is high correlation between the new and old balance of the origin, and the same for the destiny. However, as this is a very unblanaced target, we will not modify them for now.

Isolation target variable¶

In [21]:
#Select all the columns except the target variable
X_fraud = pd_fraud.drop('isFraud',axis=1)
In [22]:
#Select the target variable
y_fraud =pd_fraud['isFraud']

Categorical variables codification¶

In [23]:
list_columns_cat = list(X_fraud.select_dtypes("category").columns)
In [24]:
#convertion of categorical variables with one hot encoder
ohe = ce.OneHotEncoder(cols=list_columns_cat)
model = ohe.fit(X_fraud, y_fraud)
In [25]:
model
Out[25]:
OneHotEncoder(cols=['type', 'device', 'zone', 'security_alert'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
OneHotEncoder(cols=['type', 'device', 'zone', 'security_alert'])
In [26]:
X_fraud_t = model.transform(X_fraud, y_fraud)

# X_train_t.columns = list_columns_cat
# X_test_t.columns = list_columns_cat
In [27]:
len(list(X_fraud_t.columns))
Out[27]:
25
In [28]:
X_fraud_t.dtypes.to_dict()
Out[28]:
{'step': dtype('int64'),
 'type_1': dtype('int64'),
 'type_2': dtype('int64'),
 'type_3': dtype('int64'),
 'type_4': dtype('int64'),
 'type_5': dtype('int64'),
 'amount': dtype('float64'),
 'device_1': dtype('int64'),
 'device_2': dtype('int64'),
 'device_3': dtype('int64'),
 'device_4': dtype('int64'),
 'connection_time': dtype('float64'),
 'oldbalanceOrg': dtype('float64'),
 'age': dtype('int64'),
 'newbalanceOrig': dtype('float64'),
 'zone_1': dtype('int64'),
 'zone_2': dtype('int64'),
 'zone_3': dtype('int64'),
 'zone_4': dtype('int64'),
 'user_number': dtype('int64'),
 'user_connections': dtype('int64'),
 'security_alert_1': dtype('int64'),
 'security_alert_2': dtype('int64'),
 'oldbalanceDest': dtype('float64'),
 'newbalanceDest': dtype('float64')}
In [29]:
X_fraud_t.head()
Out[29]:
step type_1 type_2 type_3 type_4 type_5 amount device_1 device_2 device_3 ... zone_1 zone_2 zone_3 zone_4 user_number user_connections security_alert_1 security_alert_2 oldbalanceDest newbalanceDest
0 1 1 0 0 0 0 9839.64 1 0 0 ... 1 0 0 0 138 5 1 0 0.0 0.0
1 1 1 0 0 0 0 1864.28 1 0 0 ... 0 1 0 0 909 1 0 1 0.0 0.0
2 1 0 1 0 0 0 181.00 0 1 0 ... 1 0 0 0 2569 10 0 1 0.0 0.0
3 1 0 0 1 0 0 181.00 1 0 0 ... 0 1 0 0 1787 3 0 1 21182.0 0.0
4 1 1 0 0 0 0 11668.14 1 0 0 ... 0 1 0 0 3997 8 0 1 0.0 0.0

5 rows × 25 columns

Selecting Training and Test data¶

In [30]:
#20% of the data will be used as test data. Due to imbalance, use of stratify
X_train, X_test, y_train, y_test = train_test_split(X_fraud_t, y_fraud, stratify=y_fraud, test_size=0.2, random_state=1)
In [31]:
X_test['isFraud'] = y_test
X_train['isFraud'] = y_train
In [32]:
#Check whether the split has similar results
X_train.describe().round(2)
Out[32]:
step type_1 type_2 type_3 type_4 type_5 amount device_1 device_2 device_3 ... zone_1 zone_2 zone_3 zone_4 user_number user_connections security_alert_1 security_alert_2 oldbalanceDest newbalanceDest
count 838860.00 838860.00 838860.00 838860.00 838860.00 838860.00 838860.00 838860.00 838860.00 838860.00 ... 838860.00 838860.00 838860.00 838860.0 838860.00 838860.00 838860.0 838860.0 838860.00 838860.00
mean 26.96 0.34 0.08 0.36 0.01 0.22 158843.80 0.35 0.30 0.25 ... 0.25 0.35 0.30 0.1 2530.30 5.50 0.1 0.9 977107.51 1113307.51
std 15.62 0.47 0.28 0.48 0.08 0.41 265086.90 0.48 0.46 0.43 ... 0.43 0.48 0.46 0.3 1425.54 2.87 0.3 0.3 2293281.74 2413068.46
min 1.00 0.00 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.0 59.00 1.00 0.0 0.0 0.00 0.00
25% 15.00 0.00 0.00 0.00 0.00 0.00 12129.44 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.0 1296.00 3.00 0.0 1.0 0.00 0.00
50% 20.00 0.00 0.00 0.00 0.00 0.00 76315.05 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.0 2532.00 6.00 0.0 1.0 126301.18 218243.75
75% 39.00 1.00 0.00 1.00 0.00 0.00 213818.56 1.00 1.00 0.00 ... 0.00 1.00 1.00 0.0 3763.00 8.00 0.0 1.0 916432.90 1150730.90
max 95.00 1.00 1.00 1.00 1.00 1.00 10000000.00 1.00 1.00 1.00 ... 1.00 1.00 1.00 1.0 5000.00 10.00 1.0 1.0 42054659.73 42169156.09

8 rows × 25 columns

In [33]:
X_test.describe().round(2)
Out[33]:
step type_1 type_2 type_3 type_4 type_5 amount device_1 device_2 device_3 ... zone_1 zone_2 zone_3 zone_4 user_number user_connections security_alert_1 security_alert_2 oldbalanceDest newbalanceDest
count 209715.00 209715.00 209715.00 209715.00 209715.00 209715.00 209715.00 209715.00 209715.00 209715.00 ... 209715.00 209715.00 209715.00 209715.0 209715.00 209715.00 209715.0 209715.0 209715.00 209715.00
mean 27.01 0.34 0.08 0.36 0.01 0.22 157959.66 0.35 0.30 0.25 ... 0.25 0.35 0.30 0.1 2530.98 5.50 0.1 0.9 982369.78 1117736.29
std 15.64 0.47 0.28 0.48 0.08 0.41 264355.71 0.48 0.46 0.43 ... 0.43 0.48 0.46 0.3 1429.94 2.87 0.3 0.3 2310717.60 2430450.92
min 1.00 0.00 0.00 0.00 0.00 0.00 0.14 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.0 59.00 1.00 0.0 0.0 0.00 0.00
25% 15.00 0.00 0.00 0.00 0.00 0.00 12236.51 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.0 1289.00 3.00 0.0 1.0 0.00 0.00
50% 20.00 0.00 0.00 0.00 0.00 0.00 76509.09 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.0 2532.00 5.00 0.0 1.0 126703.03 218308.07
75% 39.00 1.00 0.00 1.00 0.00 0.00 213526.00 1.00 1.00 0.00 ... 1.00 1.00 1.00 0.0 3771.00 8.00 0.0 1.0 914317.74 1147577.34
max 95.00 1.00 1.00 1.00 1.00 1.00 10000000.00 1.00 1.00 1.00 ... 1.00 1.00 1.00 1.0 5000.00 10.00 1.0 1.0 41259370.36 42054659.73

8 rows × 25 columns

Download the processed data as parquet documents¶

In [34]:
#Create new CSV documetns with the training and test data separated
X_train.to_parquet("../data/training_data.parquet")
X_test.to_parquet("../data/test_data.parquet")